Homework 08

Problem 0

Using the tidyverse style guide.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(viridis)

eujingc_315_theme <-  theme_bw() +
  theme(axis.text = element_text(size = 12),
        plot.title = element_text(size = 20, face = "bold", hjust = 0),
        plot.subtitle = element_text(size = 14, face = "italic", hjust = 0),
        text = element_text(size = 14, face = "bold", color = "darkslategrey"))


Problem 1

library(MASS)
library(tidyverse)
library(GGally)
data(Cars93)
cont_cols <- which(names(Cars93) %in%
                     c("Cars93", "Price", "MPG.city",
                       "MPG.highway", "EngineSize",
                       "Horsepower", "RPM", "Fuel.tank.capacity", "Passengers",
                       "Length", "Wheelbase", "Width", "Turn.circle", "Weight"))

ggparcoord(Cars93, columns = cont_cols, showPoints = TRUE, alphaLines = 0.3) +
    aes(color = factor(Type)) +
    labs(color = "Car Type", x = "Variables", y = "Value",
         title = "Parallel Coordinates of Continuous Variables for Cars",
          subtitle = "Centered and standardized by standard deviation") +
    scale_color_viridis(discrete = TRUE) +
    eujingc_315_theme +
    theme(axis.text.x = element_text(angle = 45))

  1. Each line in the plot corresponds to an observation in our data, with the x-axis corresponding to each continuous variable and the y-axis corresponding to the standardized score of the continuous variable for each observation.

    We can see that cars of type 4 tend to have higher mileage than the other types, while cars of type 6 tend to accomodate more passengers.

  2. The parallel coordinates plot is easier to read as we can follow each line that corresponds to each observation more easily.

ggparcoord(Cars93, columns = cont_cols, showPoints = TRUE, alphaLines = 0.3) +
    aes(color = factor(Type)) +
    coord_polar() +
    labs(color = "Car Type", x = "Variables", y = "Value",
         title = "Parallel Coordinates of Continuous Variables for Cars",
          subtitle = "Centered and standardized by standard deviation") +
    scale_color_viridis(discrete = TRUE) +
    eujingc_315_theme +
    theme(axis.text.x = element_text(angle = 45))

  1. The default y-axis is centered and standardized by standard deviation. We can use scale = "center".
ggparcoord(Cars93, columns = cont_cols, showPoints = TRUE, alphaLines = 0.3,
           scale = "center") +
    aes(color = factor(Type)) +
    labs(color = "Car Type", x = "Variables", y = "Value",
         title = "Parallel Coordinates of Continuous Variables for Cars",
          subtitle = "Scaled by range and centered by mean") +
    scale_color_viridis(discrete = TRUE) +
    eujingc_315_theme +
    theme(axis.text.x = element_text(angle = 45))

  1. City mileage is positively correlated with highway mileage. We can see how cars with high city mileage with positive scaled values corresponding with high highway mileage with positive scaled values too.

    The highway mileage is negatively correlated with the engine size. We can see how cars with high mileage having positive scaled values have negative scaled values for engine size.



Problem 2

cars_cont <- dplyr::select(Cars93, Price, MPG.city, MPG.highway, EngineSize,
                           Horsepower, RPM, Fuel.tank.capacity, Passengers,
                           Length, Wheelbase, Width, Turn.circle, Weight)
library(reshape2)
correlation_matrix <- cor(cars_cont)
melted_cormat <- melt(correlation_matrix)
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "darkred", mid = "lightgrey", high = "darkblue") +
  labs(title = "Correlation of Variable Pairs",
       subtitle = "For continuous variables only",
       fill = "Correlation") +
  eujingc_315_theme +
  theme(axis.text.x = element_text(angle = 90),
        axis.title.x = element_blank(), axis.title.y = element_blank())

  1. City and highway mileage are very highly positively correlated, along with fuel tank capacity and weight, and width and weight. Weight is also very highly negatively correlated with city and highway mileage, along with fuel tank capacity and city and highway mileage. There is approximately no correlation between passengers and horsepower, passengers and price, RPM and price, and RPM and horsepower.

  2. This plot is related to a heat map because we map high a low values with some color gradient, and then visualize the values using the mapped colors.

  3. This plot reminds me of a mosaic plot for categorical variables.

correlation_matrix <- cor(cars_cont)
dd <- as.dist((1 - correlation_matrix) / 2)
hc <- hclust(dd)
correlation_matrix <- correlation_matrix[hc$order, hc$order]
correlation_matrix[lower.tri(correlation_matrix)] <- NA

melted_cormat <- melt(correlation_matrix)
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), size = 3) +
  scale_fill_gradient2(low = "darkred", mid = "lightgrey", high = "darkblue",
                       limits = c(-1, 1)) +
  labs(title = "Correlation of Variable Pairs",
       subtitle = "For continuous variables only",
       fill = "Correlation") +
  eujingc_315_theme +
  theme(axis.text.x = element_text(angle = 90),
        axis.title.x = element_blank(), axis.title.y = element_blank())



Problem 3

(20 points)

Variable Dendrograms

Another way to visually explore potential associations between continuous variables in our dataset is with dendrograms.
But we will compute a dendogram on features instead of data points. In other words, we cluster vraiables rather than cars.

  1. (15 points) Create a “variable dendrogram” of the continuous variables in the Cars93 dataset. To do this:
  • Select the continuous variables from the dataset
  • Compute the correlation matrix for these variables
  • Correlations measure similarity and can be negative, while distances measure dissimilarity and cannot be negative. As such, convert your correlations to instead be one minus the absolute value of the correlations, so that correlations near 1 or -1 will have distances of 0, and correlations near 0 will have distances of 1, e.g.: cormat <- 1 - abs(cormat)
  • Convert your transformed correlation matrix to a distance matrix with the as.dist() function.
  • Submit this distance matrix to hierarchical clustering (hclust()), convert the result to a dendrogram (as.dendrogram()), then plot with ggplot().
  • Color the branches by the four-cluster solution. See the link in HW07 for how to do this.
  • Be sure to adjust your axis labels, add a title, etc.
  • The resulting dendrogram should plot highly correlated variables (positively or negatively correlated) in the same branches / clusters in the dendrogram, while uncorrelated variables will be linked at higher “distances” on the dendrogram.
library(dendextend)

cormat <- cor(cars_cont)
cormat <- 1 - abs(cormat)
cormat <- as.dist(cormat)
hc <- hclust(cormat)
dg <- as.dendrogram(hc)
dg <- dg %>% set("labels_col", cutree(hc, k = 4))

ggplot(dg)

  1. (5 points) Examine the four-cluster solution. Which variables are in the same cluster? Does it make sense that these are in the same cluster, given both your common-sense understanding of these variables and given the correlation plot you created in Problem 2?

  2. We could use euclidean distances between the mean of each continuous variable.

If you’re finding your graphic runs over the boundaries, try standard approaches of ylim and xlim changes. Additionally, sometimes the knitted file looks different - knit once before you run.



Problem 4

Use this command

g = barabasi.game(1000,power=1,directed=FALSE)

to create a random graph that is similar to some real social networks.

library(igraph)
g <- barabasi.game(1000, power = 1, directed = FALSE)
  1. (2 points) Plot the graph using the plot command in igraph. Try different layouts and try adjusting the parameters of the plot until you get a nice plot. Use help(plot.igraph) to see the options. Your homework should only include one graph.
plot(g, layout = layout_nicely)

  1. (2 points) Now plot the graph using ggraph. Again, choose the options carefully.
library(ggraph)
ggraph(g, layout = "stress") +
    geom_node_point() +
    geom_edge_link()

  1. (2 points) Compute and report the key summary statistics of the graph.

  2. (6 points) Try two different igraph clustering algorithms. Explain what they are doing. Show the resulting plots and say what the modularity is. Which algorithm do you think worked better and why?

  3. (2 points) Draw a histogram of betweeness(g). What is the interpretation of this?



Problem 5

data(iris)
X <- log(iris[, 1:4])
  1. The options center and scale ensure that all the variables are centered (subtracted mean) and are divided by their standard deviation so that every variable has the same scale.
out <- prcomp(X, center = TRUE, scale = TRUE)
  1. The biplot shows the 2D projection of the data with the first two principal components from PCA. The arrows then represent the variables in this new projection.
biplot(out)

  1. The principal components seem to seperate the species well, except for versicolor and virginica which are not well separated.
ggplot(as.data.frame(out$x[, 1:2]),
       aes(x = PC1, y = PC2, color = iris$Species)) +
    geom_point() +
    labs(x = "PC1", y = "PC2", color = "Species",
         title = "2D Projection of Iris Data",
         subtitle = "Using First 2 Principal Components") +
    scale_color_viridis(discrete = TRUE) +
    eujingc_315_theme

  1. (4 points) Repeat the analysis using kernel pca. I suggest using the kpca functiuon in the kernlab library.
library(kernlab)

outk <- kpca(X)
  1. (2 points) Now repeat the analysis using tsne. Try perplexity = 30 and perplexity = 5. What happens as you change the perplexity?

  2. (2 points) Load the mlbench library and use this command:

X = 4*mlbench.spirals(100,1,.025)$x

Plot the data.

  1. (4 points) Use the command

sc = specc(X, centers=2)

from kernlab to perform spectral clustering. Plot the data and use two colors to indicate the result of the spectral clustering. Now repeat using kmeans clustering. Why did the spectral clustering work better?

Problem 6

(1 point each)

Arc Pie Charts

Install and load the ggforce package. This package implements several updates and improvements to ggplot2.

  1. Create an “arc pie chart” of the Type variable in the Cars93 dataset. (Code provided.)
library(ggforce)
Cars93 %>% group_by(Type) %>% 
  summarize(count = n()) %>% 
  mutate(max = max(count),
         focus_var = 0.2 * (count == max(count))) %>%
  ggplot() + geom_arc_bar(aes(x0 = 0, y0 = 0, r0 = 0.8, r = 1, 
                              fill = Type, amount = count), 
                          stat = 'pie')

  1. Adjust the r0 parameter to lower and higher values. What does this control? What is the minimum and maximum value?

  2. Recreate the graph from (a), but this time, add explode = focus_var into your call to aes(). What does this do?

  3. Recreate the graph from (c), but this time, add focus to the category with the minimum number of observations.

  4. Critique these graphs.



Problem 7

(5 points each)

Zoom

See the following code working with the IMDb movies dataset for how to use facet_zoom().

library(tidyverse)
library(forcats)
library(devtools)
library(ggforce)

#  Colorblind-friendly color pallette
my_colors <- c("#000000", "#56B4E9", "#E69F00", "#F0E442", "#009E73", "#0072B2", 
               "#D55E00", "#CC7947")

#  Read in the data
imdb <- read_csv("https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/imdb_test.csv")

# get some more variables
imdb <- mutate(imdb, profit = (gross - budget) / 1000000,
               is_french = ifelse(country == "France", "Yes", "No")) %>%
  filter(movie_title != "The Messenger: The Story of Joan of Arc")
france_1990 <- filter(imdb, country == "France", title_year >= 1990)

# this code plots a scatterplot + a zoomed facet
ggplot(data = imdb, aes(x = title_year, y = profit)) + 
  geom_point(color = my_colors[1], alpha = 0.25) + 
  geom_smooth(color = my_colors[2]) + 
  geom_point(data = france_1990, color = my_colors[3]) + 
  geom_smooth(data = france_1990, aes(x = title_year, y = profit), 
              color = my_colors[4], method = lm) + 
  facet_zoom(x = title_year >= 1990) + 
  labs(title = "Movie Profits over Time",
       subtitle = "Zoom:  French Movies from 1990 -- 2017 (orange/yellow)",
       caption = "Data from IMDB and Kaggle",
       x = "Year of Release",
       y = "Profit (millions of USD)")

Also read the articles here, or here.

  1. Recreate any scatterplot that we created throughout the year, and zoom in on a section of the graph via the facet_zoom() feature in the newest version of the ggforce package. Include a title, subtitle, and caption in the resulting graph. The caption should just state the data source, and the subtitle should explain what area of the plot is being enhanced via zooming.

  2. Interpret the resulting graph: Describe some feature of the new version of the graph that you may not have been able to see very well in the previous version of the same graph (without zooming).